Exploring COVID-19 Data in the United States

Import data

Read the raw data from an online github respository (https://github.com/nytimes/covid-19-data). It daily updates the latest real-time data of COVID-19 cases and deaths in the USA by compiling data from the New York Time.

#install packages
library(RCurl)
library(dplyr)
library(ggplot2)
library(shadowtext)
library(plotly)

a <- read.csv(text=getURL("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv"),header=T)

Data cleaning and manipulation

Set the baseline as 100 cases for each state. Set time variable as the number of days since the state reached the 100th cases

aa <- a %>%
  as_tibble %>%
  filter(cases > 100) %>%
  group_by(state) %>%
  mutate(days_since_100 = as.numeric(as.Date(date) - min(as.Date(date)))) %>%
  ungroup

#Set the points of case numbers on y-axis
breaks = c(100,200,500,1000,2000,5000,10000,20000,50000,100000,200000,500000,1000000)

Create the cumulative case curve ny state

p <- ggplot(aa, aes(days_since_100, cases, color = state)) +
  geom_line(size = 0.8) +
  geom_point(pch = 21, size = 1) +
  scale_y_log10(expand = expansion(add = c(0,0.1)),
                breaks = breaks, labels = breaks) +
  scale_x_continuous(expand = expansion(add = c(0,1))) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.minor = element_blank(),
    legend.position = "none",
    plot.margin = margin(3,15,3,3,"mm")
  ) +
  coord_cartesian(clip = "off") +
  labs(x = "Number of days since 100th case", y= "",
       subtitle = "Total number of COVID-19 cases in USA by state") +
  geom_shadowtext(aes(label = paste0("", state)), hjust=0, vjust=0,
                         data = .%>% group_by(state) %>% top_n(1, days_since_100),
                         bg.color = "white")

p

Create the interactive version of plot

fig <- ggplotly(p)
fig

Explore data and create table

Now I want to know the case fatality rate (CFR) by calculating deaths divided by cases in each state. I create this new variable. Then I want to see which state or area have the maximum, median, and minimum cases among the US. I extract the rows of this data include the latest case and death counts, and CFR. Finally I put them into a table.

#create the new data with the latest case and death counts, and CFR
s <- a %>% 
  group_by(state) %>%
  summarise(latest_cases = max(cases, na.rm=T),
            latest_deaths = max(deaths, na.rm=T),
            latest_CFR = paste0(round(latest_deaths*100/latest_cases, digits=2),'%'))

#Put the 3 rows into a table
summary_table <- 
  s[c(which.max(s$latest_cases),
      which(s$latest_cases==median(s$latest_cases)),
      which.min(s$latest_cases)), ]

summary_table$cases_level <- c('max cases','median cases','min cases')
summary_table <- summary_table[,c(1,5,2:4)]

knitr::kable(summary_table, row.names(c('max cases','median cases','min cases')),align = 'c')
state cases_level latest_cases latest_deaths latest_CFR
California max cases 817939 15791 1.93%
Arkansas median cases 82755 1350 1.63%
Northern Mariana Islands min cases 69 2 2.9%

California has the most cumulative cases with 817939 cases so far. It’s case fatality rate is 1.93%. Northern Mariana Islands has the least cumulative cases with 69 cases so far. It’s case fatality rate is 2.9%.